1 Introduction:

1.1 Questions to ask:

  • Analyze time series of achievement histories of a sample of profiles.
  • Can we forecast user engagement?

1.2 Data Details:

2 Setup and Read Data

2.1 Create Files Directory

  • Gets file directories of all CSVs to pull for analysis.
  • Applies transformations for consistent format to dates of last scraped
directory_df = create_file_directory()
directory_df = directory_transformations(directory_df)

2.2 Read Manifests

  • Load the full leaderboard of TA for exploratory data analysis.
  • Read the full achievements manifest for later analysis.
lb_df = read.csv("./data/leaderboard/leaderboard.csv")
lb_df = lb_feature_transformations(lb_df)
achievements_manifest = read.csv("./data/manifest/achievements_manifest.csv")

2.3 Read Sample of Gamers

  • We take a sample of 200 profiles using our file directory and order them.
set.seed(196)
rnd_gamer_sample = sample_random_gamers(200, directory_df = directory_df)
rnd_gamer_sample = lapply(rnd_gamer_sample, function(x) x[order(rnd_gamer_sample[[3]])])

3 Transformations

3.1 Achievement Transformations

  • Removes character entries for achievement earned.
  • Formats date and adds formatted columns for month, day of year and isoweek.
  • Creates column for tracking weekend / weekday
rnd_gamer_sample[[1]] = achievement_transform_today(rnd_gamer_sample[[1]], directory_df)
rnd_gamer_sample[[1]] = achievement_transform_yesterday(rnd_gamer_sample[[1]], directory_df)
rnd_gamer_sample[[1]] = achievement_transform_drop_offline(rnd_gamer_sample[[1]])
rnd_gamer_sample[[1]] = achievement_transform_format_dates(rnd_gamer_sample[[1]])
rnd_gamer_sample[[1]] = achievement_transform_extract_dates(rnd_gamer_sample[[1]])

3.2 Game Transformations

  • Removes entries with unattainable values.
  • Extracts hours and minutes and splits columns
  • Splits app hours from game hours
rnd_gamer_sample[[2]] = games_transform_drop_bad_titles(rnd_gamer_sample[[2]])
rnd_gamer_sample[[2]] = games_transform_hours(rnd_gamer_sample[[2]])

3.3 Metrics Preprocessing (Total)

  • Outputs total observations
  • Processes the metrics data frame for all profiles
print(paste("TOTAL OBSERVATIONS:", get_total_observations(rnd_gamer_sample[[1]])))
## [1] "TOTAL OBSERVATIONS: 401773"
metrics_df = process_metrics_df(rnd_gamer_sample, directory_df)

3.4 Frequency Data Preprocessing

  • Intermediate Step to analyze each profile in the sample later.
frequency_dfs = achievement_calculate_frequencies(rnd_gamer_sample)
frequency_combined_df = bind_rows(frequency_dfs, .id = "data_frame_id")
frequency_combined_df$data_frame_id = as.numeric(frequency_combined_df$data_frame_id)

3.5 Daily Achievements Preprocessing

  • Creates full time series profile for each gamer
  • Calculates Churn and Existence
  • Calculates EIR’s for each profile
da_df = calculate_daily_achievements(frequency_combined_df)
da_df = da_fill_dates(da_df)

da_profiles = da_split_by_profile(da_df)
da_profiles = da_profiles_set_churn(da_profiles)
## [1] "PROFILE: 150 DROPPED (All NA values)"
da_profiles = da_profiles_set_days_existence(da_profiles)
da_profiles = calculate_daily_lt_eir(da_profiles)
da_profiles = calculate_weekly_eir_all(da_profiles)
da_profiles = calculate_monthly_eir_all(da_profiles)

4 EDA (Exploratory Data Analysis)

4.1 Leaderboard EDA

4.1.1 Leaderboard Frequency Plot

  • Outputs the frequency of score ranges for the entirety of the leaderboard.
  • Interactive user can choose ranges of values.
plot_lb_range_interactive(lb_df, "Score", 0, 4000000, 1000000)
Leaderboard Interactive Histogram

Leaderboard Interactive Histogram

4.2 Profile EDA

4.2.1 Frequency Plots by Profile

  • Can select the profile and temporal metric
  • Note: This Shiny app won’t display in the self-contained HTML file. To interact with the app, you can run the RMD document in an R Markdown viewer or in the RStudio IDE.

4.3 Metrics EDA

4.3.1 Churned @ 365 Days Histogram

  • Most users from this sample, approx. 75% not churned by this definition
# Plot histogram of churned with different colors for TRUE, FALSE, and NA
ggplot(metrics_df, aes(x = churned, fill = factor(churned))) +
  geom_bar(color = "white") +
  scale_fill_manual(values = c("darkgreen", "darkred", "gray")) +
  labs(title = "Churned Histogram (365 Days Since Last Achievement)", x = "Churned Status", y = "Count")

4.3.2 Longest Streak Histogram

  • Most users have 4 or 5 days as their longest streak.
  • This sample approximates a roughly normal distribution.
ggplot(metrics_df, aes(x = longest_streak, fill = factor(longest_streak))) +
  geom_bar(color = "white") +
  labs(title = "Streak Histogram", x = "Longest Streak (in Days)", y = "Count")

4.3.3 Game Time Box Plot

  • Most players hover in the thousands of hours with several outliers above 10,000
  • This plot only shows Xbox One and Series X|S titles.
# Create the box plot for game time
ggplot(metrics_df, aes(x = "", y = total_game_time_minutes / 60, fill = "Game Time")) +
  geom_boxplot(width = 0.5, position = position_dodge(width = 0.9), color = "black", outlier.color = "darkred", outlier.shape = 16, outlier.size = 3) +
  labs(x = "", y = "Game Time (Hours)", fill = "") +
  scale_fill_manual(values = "#FF7F00") +
  theme(legend.position = "top", legend.title = element_blank()) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()

4.3.4 App Time Box Plot

  • We filter out 138 values of zero for users who don’t use apps on Xbox.
  • Of the 62 players who use apps on Xbox, most hover at or below 2,000. This suggests that the users who do have significant app time on their profile use Xbox for the apps tracked.
  • This plot only shows Xbox One and Series X|S titles.
# Create the box plot
ggplot(metrics_df[metrics_df$total_app_time_minutes > 0,], aes(x = "", y = total_app_time_minutes / 60, fill = "App Time")) +
  geom_boxplot(width = 0.5, position = position_dodge(width = 0.9), color = "black", outlier.color = "darkblue", outlier.shape = 16, outlier.size = 3) +
  labs(x = "", y = "App Time (Hours)", fill = "", caption = paste("Number of Zero Values Filtered Out:", sum(metrics_df$total_app_time_minutes == 0))) +
  scale_fill_manual(values = "#1F78B4") +
  theme(legend.position = "top", legend.title = element_blank()) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()

4.3.5 Game vs App Time Scatter Plot

  • Most players don’t have any logged time into apps regardless of game time. This suggests from this sample most players engage in app content outside of Xbox.
ggplot(metrics_df, aes(x = total_game_time_minutes / 60, y = total_app_time_minutes / 60, color = total_app_time_minutes / 60)) +
  geom_point() +
  labs(x = "Total Game Time (Hours)", y = "Total App Time (Hours)", color = "Total App Time (Hours)") +
  scale_color_gradient(low = "blue", high = "red") +
  ggtitle("Total Time: Game vs App (Hours)") +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)